AITopics | word segmentation

Collaborating Authors

word segmentation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Glyce: Glyph-vectors for Chinese Character Representations

Yuxian Meng, Wei Wu, Fei Wang, Xiaoya Li, Ping Nie, Fan Yin, Muyu Li, Qinghong Han, Xiaofei Sun, Jiwei Li

Neural Information Processing SystemsFeb-12-2026, 01:04:26 GMT

Neural Information Processing Systems http://nips.cc/

arxiv preprint arxiv, classification task, representation, (13 more...)

Neural Information Processing Systems

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > Canada (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Khmer Spellchecking: A Holistic Approach

Kong, Marry, Buoy, Rina, Chenda, Sovisal, Taing, Nguonly

arXiv.org Artificial IntelligenceNov-14-2025

Compared to English and other high-resource languages, spellchecking for Khmer remains an unresolved problem due to several challenges. First, there are misalignments between words in the lexicon and the word segmentation model. Second, a Khmer word can be written in different forms. Third, Khmer compound words are often loosely and easily formed, and these compound words are not always found in the lexicon. Fourth, some proper nouns may be flagged as misspellings due to the absence of a Khmer named-entity recognition (NER) model. Unfortunately, existing solutions do not adequately address these challenges. This paper proposes a holistic approach to the Khmer spellchecking problem by integrating Khmer subword segmentation, Khmer NER, Khmer grapheme-to-phoneme (G2P) conversion, and a Khmer language model to tackle these challenges, identify potential correction candidates, and rank the most suitable candidate. Experimental results show that the proposed approach achieves a state-of-the-art Khmer spellchecking accuracy of up to 94.4%, compared to existing solutions. The benchmark datasets for Khmer spellchecking and NER tasks in this study will be made publicly available.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.09812

Country: Asia > Cambodia (0.14)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Glyce: Glyph-vectors for Chinese Character Representations

Yuxian Meng, Wei Wu, Fei Wang, Xiaoya Li, Ping Nie, Fan Yin, Muyu Li, Qinghong Han, Xiaofei Sun, Jiwei Li

Neural Information Processing SystemsOct-2-2025, 15:33:12 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Europe (0.46)
Oceania > Australia (0.14)
North America > Canada (0.14)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.95)

Add feedback

Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

Hu, Yifan, Liang, Frank, Zhao, Dachuan, Geuter, Jonathan, Reddy, Varshini, Schmidt, Craig W., Tanner, Chris

arXiv.org Artificial IntelligenceJun-23-2025

Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such as Chinese presents significant challenges, as its frequency-driven merge operation is agnostic to linguistic boundaries. To address this, we propose two entropy-informed pre-tokenization strategies that guide BPE segmentation using unsupervised information-theoretic cues. The first approach uses pointwise mutual information and left/right entropy to identify coherent character spans, while the second leverages predictive entropy derived from a pretrained GPT-2 model to detect boundary uncertainty. We evaluate both methods on a subset of the PKU dataset and demonstrate substantial improvements in segmentation precision, recall, and F1 score compared to standard BPE. Our results suggest that entropy-guided pre-tokenization not only enhances alignment with gold-standard linguistic units but also offers a promising direction for improving tokenization quality in low-resource and multilingual settings.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.15889

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

BabyLM's First Words: Word Segmentation as a Phonological Probing Task

Goriely, Zébulon, Buttery, Paula

arXiv.org Artificial IntelligenceJun-13-2025

Language models provide a key framework for studying linguistic theories based on prediction, but phonological analysis using large language models (LLMs) is difficult; there are few phonological benchmarks beyond English and the standard input representation used in LLMs (subwords of graphemes) is not suitable for analyzing the representation of phonemes. In this work, we demonstrate how word segmentation can be used as a phonological probing task, allowing us to study the representations learned by phoneme-based language models trained on child-directed speech across 31 languages. Following computational models of word segmentation, we present unsupervised methods for extracting word boundaries from a trained model using the observation that prediction-error peaks at the start of words. We also use linear probes to identify that these models implicitly track word boundaries, even when they do not appear in training. This cross-lingual work corroborates statistical learning theories of acquisition and empirically motivates new methods for training subword tokenizers.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2504.03338

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
North America > United States (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models

Zhang, Zihong, He, Liqi, Li, Zuchao, Zhang, Lefei, Zhao, Hai, Du, Bo

arXiv.org Artificial IntelligenceMay-27-2025

Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA ($\textbf{L}$arge $\textbf{L}$anguage Model-Inspired $\textbf{A}$ho-$\textbf{C}$orasick $\textbf{A}$utomaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic $n$-gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at https://github.com/hkr04/LLACA

large language model, machine learning, segmentation, (17 more...)

arXiv.org Artificial Intelligence

2505.19631

Country:

Europe (1.00)
North America > United States (0.46)
Asia > China > Hubei Province (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Parsing Through Boundaries in Chinese Word Segmentation

Chen, Yige, Li, Zelong, Yang, Changbing, Zhang, Cindy, Cady, Amandisa, Lee, Ai Ka, Zeng, Zejiao, Pan, Haihua, Park, Jungyeul

arXiv.org Artificial IntelligenceMar-29-2025

Chinese word segmentation is a foundational task in natural language processing (NLP), with far-reaching effects on syntactic analysis. Unlike alphabetic languages like English, Chinese lacks explicit word boundaries, making segmentation both necessary and inherently ambiguous. This study highlights the intricate relationship between word segmentation and syntactic parsing, providing a clearer understanding of how different segmentation strategies shape dependency structures in Chinese. Focusing on the Chinese GSD treebank, we analyze multiple word boundary schemes, each reflecting distinct linguistic and computational assumptions, and examine how they influence the resulting syntactic structures. To support detailed comparison, we introduce an interactive web-based visualization tool that displays parsing outcomes across segmentation methods.

artificial intelligence, compound, natural language, (16 more...)

arXiv.org Artificial Intelligence

2503.23091

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
Asia > China > Hong Kong (0.04)
North America > United States > Maryland > Baltimore (0.04)
(8 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950

Fang, Zhao, Wu, Liang-Chun, Kong, Xuening, Stewart, Spencer Dean

arXiv.org Artificial IntelligenceMar-25-2025

This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2503.19844

Country:

Asia > China > Shanghai > Shanghai (0.26)
North America > United States > Illinois > Cook County > Chicago (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

Using Context to Improve Word Segmentation

Hu, Stephanie, Guo, Xiaolu

arXiv.org Artificial IntelligenceMar-13-2025

An important step in understanding how children acquire languages is studying how infants learn word segmentation. It has been established in previous research that infants may use statistical regularities in speech to learn word segmentation. The research of Goldwater et al., demonstrated that incorporating context in models improves their ability to learn word segmentation. We implemented two of their models, a unigram and bigram model, to examine how context can improve statistical word segmentation. The results are consistent with our hypothesis that the bigram model outperforms the unigram model at predicting word segmentation. Extending the work of Goldwater et al., we also explored basic ways to model how young children might use previously learned words to segment new utterances.

boundary, phoneme, segmentation, (16 more...)

arXiv.org Artificial Intelligence

2503.10023

Country: North America > United States > New York (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.47)

Add feedback

Mining Word Boundaries from Speech-Text Parallel Data for Cross-domain Chinese Word Segmentation

Wang, Xuebin, Zhang, Lei, Li, Zhenghua, Zhou, Shilin, Gong, Chen, Hou, Yang

arXiv.org Artificial IntelligenceDec-12-2024

Inspired by early research on exploring naturally annotated data for Chinese Word Segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to explicitly mine word boundaries from speech-text parallel data. We employ the Montreal Forced Aligner (MFA) toolkit to perform character-level alignment on speech-text data, giving pauses as candidate word boundaries. Based on detailed analysis of collected pauses, we propose an effective probability-based strategy for filtering unreliable word boundaries. To more effectively utilize word boundaries as extra training data, we also propose a robust complete-then-train (CTT) strategy. We conduct cross-domain CWS experiments on two target domains, i.e., ZX and AISHELL2. We have annotated about 1,000 sentences as the evaluation data of AISHELL2. Experiments demonstrate the effectiveness of our proposed approach.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.09045

Country:

North America > Canada > Quebec > Montreal (0.24)
North America > United States > Pennsylvania (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback